Shiny app link: https://groupa-autumnclass.shinyapps.io/GroupProject-Shiny/
The Covid event created new incentives for city dwellers to move from condos to detached either in or outside Toronto.
In this context we propose a simple tool for getting informed on the transactions that have happened in Toronto with the purpose of both:
estimating of the market value of a property in Toronto given location and other user defined features
and, given budget and other constraints,visualizing the properties sold.
At the moment, given the project time budget constraint, even though we have been able to web scrape the latest available listings on zoocasa but did not have sufficient time to clean and process the data. Instead, to illustrate the business case solution, we used a pre-existing dataset available as at H2 2019 which was alreadty cleaned and processed.
The current project builds upon a pre-existing project on Toronto housing price prediction available at https://github.com/slavaspirin/Toronto-housing-price-prediction
The dataset used from the pre-existing project is available at https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv
The data from the pre-existing project is based on web scarping of Zoocasa listings of previously sold properties. Unfortunately the data does not have a time stamp. We understand that the primary listing data was scraped from https://www.zoocasa.com and contains a list of sold properties available as sometimes in H2 2019.
We were able to scrap a scoring of Toronto neighborhoods from https://torontolife.com/neighbourhood-rankings/ to complemet the average 2016 personal income data of each district that was already available in the pre-existing project.
Toronto neighborhoods data for geographical mapping was available from https://open.toronto.ca/dataset/neighbourhoods/.
We were also able to obtain location data for the Toronto subway stations from https://scruss.com/blog/2005/12/14/toronto-subway-station-gps-locations/#comments. For the line currently in construction, Line 5 Eglington, the subway stations latitude and longitude were obtained from https://en.wikipedia.org/wiki/Line_5_Eglinton.
The data available includes 15234 listings of Toronto properties with the following available features.
| Variable Name | Description |
|---|---|
| title | text, Zoocasa short description of the listing |
| final price | numeric,sale price |
| listed price | numeric, listed price |
| bedrooms | text ordinal, 0 beds, 0 + 1 beds, 1 beds … 9 + 5 beds |
| bedrooms>grade | numeric, number of bedrooms above grade |
| bedrooms<grade | numeric, number of bedrooms below grade |
| bathrooms | mumeric, 1 to 11 |
| sqft | Missing or numeric between 259 to 4374 |
| description | text, Zoocasa long description of the listed property |
| mls | text, zoocasa identifier |
| type | text categorical, Att/Row/Twnhouse, Comm Element Condo, Condo Apt, Condo Townhouse, Co-Op Apt, Co-Ownership Apt, Detached, Link, Plex, Semi-Detached, Store W/Apt/Offc |
| full link | text,Zoocasa web link |
| lat | numeric, property location latitude |
| long | numeric, property location longitude |
| city district | text, Toronto city district |
| district code | numeric, Toronto city district identifier code |
| mean district income | numeric, Toronto city district average household income based on 2016 statics |
Based on the “bedrooms>grade” and “bedrooms<grade” we created an aggregated bedrooms feature calculated as “bedrooms>grade”+bedrooms<grade/2 to account for the smaller size of the below grade bedrooms.
Based on the “listed price” and “final price” we created an “price differential” feature calculated as “final price”/“listed price” - 1. Even though such a feature is not necessarily useful for predicting the sale price, it is informative with respect to the pricing error that property sellers have encountered and may be informed upon with when listing a property.
Based on location data for the Toronto subway stations (including Line 5 Eglinton) we were able to estimate the closest subway station to a property and estimate, assuming and average walking speed of 5 km/h the walking distance to the closest subway station for each property.
| Variable Name | Description |
|---|---|
| district code | numeric,Toronto city district identifier code |
| area name | text,Toronto city district name |
| description | text,description of the district |
| housing score | numeric, score based on affordability (cost vs. income), appreciation (yoy change) and rate of home ownership |
| safety score | numeric, score based on number of crimes |
| transit score | numeric, score based on number of TTC stops, walk and transit scores, commuting times, numbers of commuters who walk, cycle or take TTC |
| shopping score | numeric, score based on number of groceries, markets and pharmacies per km2 |
| health score | numeric, score based on number of medical and mental health services per capita, number of senior care services per senior, number of people with family doctors and physical activity levels among residents |
| entertainment score | score based on numeric,number of gyms, sport facilities, bars and restaurants per km2 |
| community score | numeric, score based on voter turnout, community space use per capita, how many people report a sense of community belonging |
| education score | numeric, score based on number of schools per child, number of daycares per child, share of residents with post-secondary education |
| diversity score | numeric, score based on % of visible minorities , people whose mother tongues are not French or English, and first- and second generation immigrants |
| employment score | numeric, score based on employment and unemployment rates, the share of residents below the poverty line, the share of high income residents and the share of self employed residents |
The data pertaining to housing listings was cleaned,aggregated and readily available on https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv
The histogram of sale prices and log sale prices for the most condos, townhouses, detached and semi-detached(the inclusion criteria captures 94% of our listings) indicate that the log transforms shift the distribution closer to a Gaussian one.
One can also observe that the sale relative to listed price has a positively skewed distribution when above zero, meaning that sellers were getting more than listed. Nevertheless, the very high values observed (50% mark up) make us less inclined to use the listed price in our analysis. It is possible there may be a bias related to increasing the marketability of the property and gathering more offers during the listing period.
An interesting property of condos is related to the subway walking distance feature. Condos are concentrated within less than 39 minutes to the closest subway station as the corresponding histogram abruptly drops around the 39 minutes threshold.
The Toronto district scores seem to be designed to be uniformly distributed, in contrast with the district average income which seems to be non-uniform.
Given the nature of the district average income having a different distribution that the districts scores we use the Spearman (rank) correlation.
One can observe the following strong correlations:
We have also computed the Pearson correlations and observed that the most extreme values was 0.83 (shopping vs. entertainment scores). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).
We have computed the Pearson correlations to observe how all property and location features are interconnected. As expected, the size of the apartment influences the sale price. In terms of district location, district average income and the sores for safety, diversity and employment are also significantly correlated (1% p-value test) with the sale price.
Unexpectedly, the “subway walking distance” to has a negative, small correlation with the final sale price. One of the reasons could be related to differences in this relationship across types of properties:
for condos, subway closeness is much more important than for detached,
for detached houses, the size of the property, and implicitly the price, is higher, the further away one gets from the subway lines.
We note that the most extreme correlation values are below 0.85 (excluding the “bedroom>grade” vs “Bedrooms Agg” correlation). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).
The following mappings attempt to provide an overview of the listings available as wel as the relationship between districts and house prices.
One can observe that the number of listings not uniformly concentrated across Toronto. With respect to housing price levels, both the detached and condo market seem to be exhibiting a similar geographical distribution with the district average income, confirming the positive correlation between prices and district average income.
To predict the house prices we use the data set mentioned in the data sources section as our training data to build a machine learning model. Once the model is built and evaluated we will deploy this to the shiny app for predicting the price of an user input house. Given the time constraint we did not include the “walking distance to subway” feature which would have required more time for implementing in the shiny app.
Linear regression is a basic and very commonly used predictive algorithm. Two main types of information can be obtained with linear regression, the extent at which a set of independent variables (X) can predict a dependent variable (Y), and which of the variables are significantly related to Y.
In it’s simplest form a linear regression can be described by the formula y = c + b * x, where y is the dependent variable, c is a constant (y-intercept), b is the regression coefficient (slope), and x is the independent, or predictive variable. Linear regression assumes linearity between X and the mean of Y, homoskedasticity, meaning that the variance of the residual is the constant and is independent of Y and the predictive variables X.
Linear regression can accommodate both numerical and categorical explanatory variables (the categorical predictors are transformed into dummy variables).
In this section we will build a multiple linear regression model for predicting the response - final_price, regressed on the remaining predictor variables in the data set. We will use the least squares minimization approach in this project.
We removed the variables from our data frames that are not relevant to the project.
Based on our exploratory data analysis, the detached and non-detached markets show distinctively separate dynamics in their behavior in the context of this project. For e.g detached houses show large variations in the price, the absolute price ranges are wide and higher and correlation between other variables show different trends. So to optimize our results we approach by developing 2 separate models - one for the detached market and non-detached market. Note: The non-detached market includes all other types of housing including semi-detached, condos, freeholds, townhouses etc.
We see that the predictor sqft is missing data for many observations. We will impute this missing data using multivariate imputation by chained equations (MICE) technique. We use the “mice” package in R for this purpose and use the Random Forest method with default imputation cycles (5).
To make the best estimations of the missing sqft values, we run the imputation process separately on the detached and non-detached houses.
Perform a sanity check of imputation and then apply the imputed values to the dataframe.
We can also apply the semi-parametric imputation approach Predictive Mean Matching (PMM) for the same.
PMM calculations involve several steps. First, a linear regression model is estimated to predict the feature Y that contains the missing values, whereby only the observed instances are used. Second, the predicted values for the observed and missing values are calculated. Third, for each instance where Y is missing, the closest predicted values among the observed instances are found, and finally, one of these close instances is randomly chosen and the missing value is imputed with the observed value from this close instance.
PMM also allows for discrete target variables, and has been proven to be a versatile and robust method.
It is important to understand the correlations between the variables, including both predictors and response to improve the accuracy of the model.
The variable correlations available in the descriptive statistics as well as the correlations matrix for our detached and non-detached, below, show that certain groups of variables are strongly correlated but not as high as to pose problems of multi-colinearity.
In addition, we address indirectly the strong correlation issues through a variable selection approach as described into the next section.
We started with the method of ** backward selection** i.e start with all the variables in the model and remove the variables that shows least statistical significance (p-values).
We added variables previously removed based on their observed relationship with the sale price. Thus we will eventually be adopting a mixed selection approach for variables.
All categorical variables are be factored using the default dummy variable encoding in R.
We used the Cook’s distance to detect observations that strongly influence the model. We also looked at the leverage vs. the residual plots to identify outliers or observations which makes a significant impact at the cost of reducing the overall accuracy.
During the iterative model building process we found that in the detached housing segment, properties with sale price of over $4M have very few observations but have a bigger impact by reducing the overall accuracy of the model. So we treat those observations as outliers and remove them when we train the model.
Two important assumptions in a linear regression model are that relationship between the predictors and response are additive and linear. This is almost never the case in real use cases as in our project. This shortcoming can be partially addressed by transforming numerical variables into categorical ones but it depends on a subjective approach or business information how the binning should be conducted.
Also, to recognize the non-linear effect due to interactions between the explanatory variable, we incorporated interaction effects between variables. We have also introduced non-linear effects where a non-linear relationship was observed between a predictor and the sale price.
We test a range of effects and finally retain the best effects in the model that provide significant improvement to the model fit.
Distribution of final sale price across all the markets.
Most of the variance is due to the detached market .
The proposed model recognizes the non-linear effect due to interactions between the explanatory variable through the interaction effects between variables.
We have also incorporated non-linear effects based on the non-linear relationship was observed between a predictor and the sale price.
##
## Call:
## lm(formula = final_price/1000 ~ ., data = detached_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1445.99 -138.77 -7.39 112.60 2128.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.031e+04 3.019e+03 -6.728 1.94e-11 ***
## bathrooms 1.416e+02 1.204e+01 11.757 < 2e-16 ***
## sqft -1.691e-01 1.831e-02 -9.237 < 2e-16 ***
## parking 1.955e+01 2.960e+00 6.604 4.48e-11 ***
## long -2.345e+02 3.806e+01 -6.162 7.87e-10 ***
## mean_district_income -1.623e-02 1.028e-03 -15.788 < 2e-16 ***
## bedrooms_ag 2.108e+02 1.471e+01 14.332 < 2e-16 ***
## housing -2.598e+00 2.242e-01 -11.584 < 2e-16 ***
## safety 9.681e-01 1.940e-01 4.991 6.25e-07 ***
## transit 2.151e+00 2.804e-01 7.670 2.11e-14 ***
## entertainment 1.811e+00 2.502e-01 7.239 5.35e-13 ***
## community -5.399e-01 1.962e-01 -2.752 0.00595 **
## education 8.143e-01 1.836e-01 4.435 9.46e-06 ***
## sqft_x_bathrooms 6.798e-02 4.762e-03 14.274 < 2e-16 ***
## ibedrooms_ag_x_bathrooms -5.041e+01 4.167e+00 -12.098 < 2e-16 ***
## sqft_x_mean_district_income 1.811e-06 1.446e-07 12.530 < 2e-16 ***
## bathrooms__squared 3.211e+00 2.611e+00 1.230 0.21885
## mean_district_income__square_root 1.214e+01 5.425e-01 22.375 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.1 on 4271 degrees of freedom
## Multiple R-squared: 0.8129, Adjusted R-squared: 0.8122
## F-statistic: 1092 on 17 and 4271 DF, p-value: < 2.2e-16
The proposed model recognizes the non-linear effect due to interactions between the explanatory variable through the interaction effects between variables.
##
## Call:
## lm(formula = final_price/1000 ~ . + sqft:mean_district_income +
## sqft:type + bedrooms_ag:bathrooms + sqft:bathrooms + I(mean_district_income^0.5),
## data = nondetached_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1574.2 -81.8 -9.9 69.2 3699.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.693e+04 4.793e+03 7.706 1.41e-14 ***
## bathrooms 5.649e+00 6.891e+00 0.820 0.412394
## sqft -3.339e-01 2.650e-02 -12.597 < 2e-16 ***
## parking 1.911e+01 2.558e+00 7.472 8.48e-14 ***
## typeCo-Op Apt -4.501e+02 9.599e+01 -4.689 2.78e-06 ***
## typeCo-Ownership Apt -7.323e+02 1.254e+02 -5.839 5.41e-09 ***
## typeComm Element Condo -6.248e+02 5.089e+01 -12.277 < 2e-16 ***
## typeCondo Apt -6.566e+02 3.487e+01 -18.829 < 2e-16 ***
## typeCondo Townhouse -3.671e+02 3.813e+01 -9.629 < 2e-16 ***
## typeLink 1.863e+02 1.489e+02 1.251 0.211013
## typePlex 5.645e+02 7.471e+01 7.556 4.50e-14 ***
## typeSemi-Detached -2.542e+01 3.683e+01 -0.690 0.490063
## typeStore W/Apt/Offc -3.706e+02 1.713e+02 -2.163 0.030532 *
## lat -6.914e+02 7.300e+01 -9.472 < 2e-16 ***
## long 8.963e+01 2.657e+01 3.373 0.000746 ***
## mean_district_income -1.389e-02 4.412e-04 -31.479 < 2e-16 ***
## bedrooms_ag 1.929e+01 6.231e+00 3.095 0.001972 **
## bedrooms_bg 8.336e+00 3.364e+00 2.478 0.013219 *
## housing -1.327e+00 1.022e-01 -12.983 < 2e-16 ***
## safety 2.469e-01 8.838e-02 2.794 0.005218 **
## transit 1.802e+00 1.319e-01 13.662 < 2e-16 ***
## shopping 7.394e-01 1.459e-01 5.068 4.09e-07 ***
## health 1.563e-01 7.708e-02 2.027 0.042671 *
## entertainment 1.385e+00 1.637e-01 8.461 < 2e-16 ***
## diversity 1.146e+00 1.605e-01 7.142 9.76e-13 ***
## education -3.488e-01 8.500e-02 -4.104 4.09e-05 ***
## I(mean_district_income^0.5) 7.654e+00 2.582e-01 29.642 < 2e-16 ***
## sqft:mean_district_income 4.141e-06 9.925e-08 41.727 < 2e-16 ***
## sqft:typeCo-Op Apt 1.126e-01 9.120e-02 1.235 0.216827
## sqft:typeCo-Ownership Apt 2.765e-01 1.688e-01 1.637 0.101575
## sqft:typeComm Element Condo 3.990e-01 4.964e-02 8.037 1.02e-15 ***
## sqft:typeCondo Apt 4.333e-01 2.093e-02 20.703 < 2e-16 ***
## sqft:typeCondo Townhouse 1.539e-01 2.359e-02 6.525 7.10e-11 ***
## sqft:typeLink -9.640e-02 8.729e-02 -1.104 0.269494
## sqft:typePlex -1.812e-01 3.424e-02 -5.292 1.23e-07 ***
## sqft:typeSemi-Detached 6.044e-02 2.110e-02 2.864 0.004188 **
## sqft:typeStore W/Apt/Offc 1.945e-01 8.760e-02 2.220 0.026435 *
## bathrooms:bedrooms_ag -1.210e+01 2.021e+00 -5.990 2.17e-09 ***
## bathrooms:sqft 7.991e-02 5.843e-03 13.675 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 178.7 on 10832 degrees of freedom
## Multiple R-squared: 0.7766, Adjusted R-squared: 0.7758
## F-statistic: 990.8 on 38 and 10832 DF, p-value: < 2.2e-16
For both the detached and non-detached segments the model diagnostics statistics shows that there is still a remaining non-linear relationship between the estimated level of the sale price and the model estimated residuals as shown in the “Residuals vs fitted” and the “Scale-Location” plots where the volatility of the residuals increases as the estimated level of the sale price increases.
In addition, one can observe from the “Normal q-q” plots that the residuals are not normally distributed, with extreme divergence beyond +/- 2 stdev.
Also the residuals standard errors are quite high, around 270 thousand CAD for detached and 180 thousand CAD for non-detached, sizes which are acceptable for an average property price of 2 million CAD but less acceptable for a property valued at around 700 thousand CAD.
As this is a first, time-constrained attempt at modeling property sale price data, we choose to implement these model for illustrative purposed in our shiny app. On the other hand, we are fully aware that more research needs to be done on model selection, especially if we are aiming to find a model that is easy to implement for on-the-fly evaluation.
We have already evaluated the accuracy of coefficients while iterating through the model building process and selected the variables with significant statistical importance determined by the p-values.
Note: When interaction effects are employed, some of the base variables (main effects) may show reduced significance but that does not reduce the importance of the variable if the interaction effects themselves rank high. The base variables cannot be removed in such cases as per the hierarchical principle.
We evaluate the accuracy of the models in the below steps using MSE and R2 statistics based on an out-of-sample estimation test. We split the detached and non-detached data we used into 2 training and test data sets respectively. We retrain the same models using the new training subset and test them on the test data.
MSE Statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset
## [1] "252.6 thousand CAD"
R2 Statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset
## [1] "80.4 %"
MSE statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset
## [1] "145.5 thousand CAD"
R2 statistic for the sale price estimate on 20% of the dataset based on a model trained on the other 80% of the dataset
## [1] "75.1 %"
In our project, we used the random forest model as a comparison to the linear regression model with respect to the selection of explanatory variables, estimation errors and model fit.
Random Forest is an ensembling machine learning algorithm which consists in generating multiple decision trees based on random sampling of the data and the predictor variables and then combining their output. for every explanatory variable a classification is generated. Based on belonging to specific classes of a randomly selected set of variable the decision three is built. The user decides how many variables are used in constructing the decision tree. The construction of the classes for each variables accommodates both categorical and numerical variables (continuous or discontinuous). The random sampling serves to de-correlate the trees and subsequently reduce the Variance by averaging them and avoid overfitting. The user decides how many trees are used for model averaging. The random forest model needs to be tuned with respect to the number of decision trees and the number of variables randomly sampled at each stage.
One can observe that the distance to subway, number of bathrooms and average district income occur across all random forest models.
For detached, district scores on community, education and transit appear in addition to the above
For condos, district scores on shopping and health appear in addition to the above.
Random forest R-squared and MSE as a function of the number of trees. For each model the number of variables randomly sampled are selected based on maximizing the R2 and MSE comparison across across a 1 to 15 range.
The model and the shiny app could be used by a real estate company to provide potential customers with a self-serving tool that they can use prior to contacting the company.
The shiny app published at https://groupa-autumnclass.shinyapps.io/GroupProject-Shiny/ has two features available for the customer:
the first feature enables, potential sellers to obtain a ballpark figure value of what their house might be worth on the market.
the second feature enables, potential buyers to visualize properties sold, based on their specific criteria (available budget, required square footage, number of bedrooms/bathrooms, etc.).
The tool has potential for improvement in three directions: model estimation, additional explanatory factors and improving the shiny app.
For both the detached and non-detached segments the model diagnostics statistics show that there is still a remaining non-linear relationship between the estimated level of the sale price and the model estimated residuals.Also, the residuals standard errors are quite high, around 270 thousand CAD for detached and 180 thousand CAD for non-detached, sizes which are acceptable for an average property price of 2 million CAD but less acceptable for a property valued at around 700 thousand CAD.
The model can be enhanced by:
In it’s current version the district-related predictors (neighborhood rankings) are not necessarily accurate representations of the attractiveness of the immediate neighborhood for the buyer as Toronto districts are relatively large and heterogeneous. Each individual customer has different preferences in terms of what neighborhood criteria are more or less important to them.
Given the time constraint we did not include the “walking distance to subway” feature as it required more time for implementing in the shiny app. The feature was detected as important by the random forest algorithm and should be included at the nest project iteration.
Finally, if environmental parameters, such as closeness to large parks and Lake Ontario, as well as neighborhood quietness and distance to major arterial roads (noise/air pollution) could be included, this tool would be potentially very valuable.
Potential improvements we have already discussed but did not have time to implement:
adding the estimated price standard-deviation when estimating the price for a property
adding a sales price histogram to visualize the range of prices when visualizing the properties sold, based on user-defined criteria
adding a filter option based on a geographical radius around an address on the map for the user to visualize the properties within a certain radius
adding subway lines and the subway proximity criteria